-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: GH2003 Series.isin for categorical dtypes #20522
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to move your categorical isin logic to a new method Categorical.isin
.
Then in algorithms.py
you can do
if is_categorical_dtype(values):
values = getattr(values, '_values', values) # extract Categorical from Series/Index
# your code
Can you add an ASV for this?
@@ -3507,7 +3507,11 @@ def isin(self, values): | |||
5 False | |||
Name: animal, dtype: bool | |||
""" | |||
result = algorithms.isin(com._values_from_object(self), values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the _values_from_object
can be moved to algorithms.isin
? Then this could just be result = algorithms.isin(self, values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes let's try to do this, @Ma3aXaKa can you make this change
@@ -1255,6 +1255,17 @@ def test_isin_empty(self, empty): | |||
result = s.isin(empty) | |||
tm.assert_series_equal(expected, result) | |||
|
|||
def test_isin_cats(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can go in pandas/tests/categorical/test_algos.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
To be a bit clearer about this. I'd imagine the |
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -345,6 +345,7 @@ Other Enhancements | |||
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`) | |||
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`) | |||
- zip compression is supported via ``compression=zip`` in :func:`DataFrame.to_pickle`, :func:`Series.to_pickle`, :func:`DataFrame.to_csv`, :func:`Series.to_csv`, :func:`DataFrame.to_json`, :func:`Series.to_json`. (:issue:`17778`) | |||
- Performance enhancement for :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to the "Performance Improvements" section? (starts around line 783).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@TomAugspurger Why do I need to call |
Sorry, I meant `isin` :)
…On Wed, Mar 28, 2018 at 5:30 PM, Artem Bogachev ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> Why do I need to call algorithms.value_counts(codes,
code_values) in Categorical.isin
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20522 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhSHMMtDX_sbWG8bTcnhm9cg6tDVks5tjA8SgaJpZM4S_K1T>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have an associated issue? this would need asv benchmarks
pandas/core/algorithms.py
Outdated
@@ -403,8 +403,15 @@ def isin(comps, values): | |||
if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)): | |||
values = construct_1d_object_array_from_listlike(list(values)) | |||
|
|||
comps, dtype, _ = _ensure_data(comps) | |||
values, _, _ = _ensure_data(values, dtype=dtype) | |||
if not is_categorical_dtype(comps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse this logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I am working on asv benchmark
Codecov Report
@@ Coverage Diff @@
## master #20522 +/- ##
==========================================
+ Coverage 91.77% 91.77% +<.01%
==========================================
Files 153 153
Lines 49257 49270 +13
==========================================
+ Hits 45207 45220 +13
Misses 4050 4050
Continue to review full report at Codecov.
|
@TomAugspurger I have moved the logic to And I cannot understand why checks on CircleCI have failed. The same tests on my laptop are passing. |
Merged in master. That'll hopefully fix the circleCI failure. |
@Ma3aXaKa did you run the ASV benchmarks? Or just a simple |
It looks like it can be just about anything. Did you run into issues without using |
Yep. I have run the |
Sure (just the new one). I'm curious to see the performance improvement.
…On Mon, Apr 2, 2018 at 4:50 PM, Artem Bogachev ***@***.***> wrote:
Yep. I have run the asv benchmark. I got the positive result. Do I need
to post them here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20522 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvItwyqche6UatcRKiPvpfFxiyIPks5tkp0JgaJpZM4S_K1T>
.
|
Can you also add a release note (doc/source/whatsnew/v0.23.0.txt) noting the performance improvement? Make sure to pull my commit first. |
I've already added the line about the performance improvement to |
Sorry, missed that. |
That's what I got on my laptop from
|
pandas/core/series.py
Outdated
@@ -3564,7 +3564,10 @@ def isin(self, values): | |||
5 False | |||
Name: animal, dtype: bool | |||
""" | |||
result = algorithms.isin(com._values_from_object(self), values) | |||
if is_categorical_dtype(self): | |||
result = self._values.isin(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a more ducklike check would work here, something like
if hasattr(self._values, 'isin'):
result = self._values.isin(values)
else:
......
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback What's the purpose? Is it better for performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its more generic. you don't have to know its a categorical here.
@jreback Changed the condition to |
I would prefer to convert |
use np.asarray as |
Hello @Ma3aXaKa! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 25, 2018 at 10:02 Hours UTC |
@TomAugspurger @jreback I have added the docs and fixed the problem with Python 2 (thanks to @TomAugspurger). Is there something else to be done? |
|
||
def test_isin_cats(): | ||
cat = pd.Categorical(["a", "b", np.nan]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add the issue number here as a comment
asv_bench/benchmarks/categoricals.py
Outdated
@@ -148,3 +148,18 @@ def time_rank_int_cat(self): | |||
|
|||
def time_rank_int_cat_ordered(self): | |||
self.s_int_cat_ordered.rank() | |||
|
|||
|
|||
class IsIn(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u make Isin
@@ -407,6 +407,12 @@ def isin(comps, values): | |||
if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)): | |||
values = construct_1d_object_array_from_listlike(list(values)) | |||
|
|||
if is_categorical_dtype(comps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u change this to is_extension_type does everything still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can you add a note: TODO(extension) here
asv_bench/benchmarks/categoricals.py
Outdated
self.sample = np.random.choice(arr, sample_size) | ||
self.ts = pd.Series(arr).astype('category') | ||
|
||
def time_isin_categorical_strings(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 4 cases in the original issue can you cover them
@@ -407,6 +407,12 @@ def isin(comps, values): | |||
if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)): | |||
values = construct_1d_object_array_from_listlike(list(values)) | |||
|
|||
if is_categorical_dtype(comps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can you add a note: TODO(extension) here
pandas/core/arrays/categorical.py
Outdated
raise TypeError("only list-like objects are allowed to be passed" | ||
" to isin(), you passed a [{values_type}]" | ||
.format(values_type=type(values).__name__)) | ||
from pandas.core.series import _sanitize_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move the import to top of function
tm.assert_numpy_array_equal(expected, result) | ||
|
||
result = cat.isin(["a", "c"]) | ||
expected = np.array([True, False, False], dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a couple of tests in pandas/tests/test_algos.py that test cats for isin, can you move here
@jreback I've made requested changes except moving tests to a different file (see comments). But the Travis CI build is failing with an S3 error. |
@jreback There are still errors with S3. Is this OK? |
I'm looking into the s3 / moto issues today. Things should be OK on your
end.
…On Wed, Apr 18, 2018 at 8:10 AM, Artem Bogachev ***@***.***> wrote:
@jreback <https://github.com/jreback> There are still errors with S3. Is
this OK?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20522 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIol8ciUhkCI4qEyopTCeyK5eakb2ks5tpztDgaJpZM4S_K1T>
.
|
@TomAugspurger Hi. Did you fix the issue with S3? Is there something left to do before merging the. ranch? |
@Ma3aXaKa can you rebase |
rebased, so let's merge on green. |
thanks @Ma3aXaKa what looked like an easy issue morphed into code refactoring! thanks for the patch and patience! keep em coming! |
I have added a branching for the categorical case in
Series.isin
function.I have also added a test for the most crucial cases (nans).
closes #20003